A Boyer-Moore Type Algorithm for Compressed Pattern Matching

نویسندگان

  • Yusuke Shibata
  • Tetsuya Matsumoto
  • Masayuki Takeda
  • Ayumi Shinohara
  • Setsuo Arikawa
چکیده

We apply the Boyer–Moore technique to compressed pattern matching for text string described in terms of collage system, which is a formal framework that captures various dictionary-based compression methods. For a subclass of collage systems that contain no truncation, our new algorithm runs in O(‖D‖ + n · m + m + r) time using O(‖D‖ + m) space, where ‖D‖ is the size of dictionary D, n is the compressed text length, m is the pattern length, and r is the number of pattern occurrences. For a general collage system, the time complexity is O(height(D)·(‖D‖+n)+n·m+m2+r), where height(D) is the maximum dependency of tokens in D. We showed that the algorithm specialized for the so-called byte pair encoding (BPE) is very fast in practice. In fact it runs about 1.2 ∼ 3.0 times faster than the exact match routine of the software package agrep, known as the fastest pattern matching tool.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Compressed-Domain Pattern Matching with the Burrows-Wheeler Transform

This report investigates two approaches for online pattern-matching in files compressed with the Burrows-Wheeler transform (Burrows & Wheeler 1994). The first is based on the Boyer-Moore pattern matching algorithm (Boyer & Moore 1977), and the second is based on binary search. The new methods use the special structure of the BurrowsWheeler transform to achieve efficient, robust pattern matching...

متن کامل

Project 2: Pattern Matching in Compressed DNA Sequence

Space efficient storage of large genome sequences requires good compression techniques. However, if these sequences need to be decompressed, before any processing can be done over them, the advantage of compression is lost. New techniques are required to extend the traditional pattern matching algorithms to work directly on the compressed sequence. This saves space in memory, requires less disk...

متن کامل

Accelerating Boyer Moore Searches on Binary Texts

The Boyer and Moore (BM) pattern matching algorithm is considered as one of the best, but its performance is reduced on binary data. Yet, searching in binary texts has important applications, such as compressed matching. The paper shows how, by means of some pre-computed tables, one may implement the BM algorithm also for the binary case without referring to bits, and processing only entire blo...

متن کامل

Boyer - Moore String Matching over Ziv -

We present a Boyer-Moore approach to string matching over LZ78 and LZW compressed text. The key idea is that, despite that we cannot exactly choose which text characters to inspect, we can still use the characters explicitly represented in those formats to shift the pattern in the text. We present a basic approach and more advanced ones. Despite that the theoretical average complexity does not ...

متن کامل

Searching BWT Compressed Text with the Boyer-Moore Algorithm and Binary Search

This paper explores two techniques for on-line exact pattern matching in files that have been compressed using the Burrows-Wheeler transform. We investigate two approaches. The first is an application of the Boyer-Moore algorithm (Boyer & Moore 1977) to a transformed string. The second approach is based on the observation that the transform effectively contains a sorted list of all substrings o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000